Current status of artificial intelligence analysis for the treatment of pancreaticobiliary diseases using endoscopic ultrasonography and endoscopic retrograde cholangiopancreatography

Abstract Pancreatic and biliary diseases encompass a range of conditions requiring accurate diagnosis for appropriate treatment strategies. This diagnosis relies heavily on imaging techniques like endoscopic ultrasonography and endoscopic retrograde cholangiopancreatography. Artificial intelligence (AI), including machine learning and deep learning, is becoming integral in medical imaging and diagnostics, such as the detection of colorectal polyps. AI shows great potential in diagnosing pancreatobiliary diseases. Unlike machine learning, which requires feature extraction and selection, deep learning can utilize images directly as input. Accurate evaluation of AI performance is a complex task due to varied terminologies, evaluation methods, and development stages. Essential aspects of AI evaluation involve defining the AI's purpose, choosing appropriate gold standards, deciding on the validation phase, and selecting reliable validation methods. AI, particularly deep learning, is increasingly employed in endoscopic ultrasonography and endoscopic retrograde cholangiopancreatography diagnostics, achieving high accuracy levels in detecting and classifying various pancreatobiliary diseases. The AI often performs better than doctors, even in tasks like differentiating benign from malignant pancreatic tumors, cysts, and subepithelial lesions, identifying gallbladder lesions, assessing endoscopic retrograde cholangiopancreatography difficulty, and evaluating the biliary strictures. The potential for AI in diagnosing pancreatobiliary diseases, especially where other modalities have limitations, is considerable. However, a crucial constraint is the need for extensive, high‐quality annotated data for AI training. Future advances in AI, such as large language models, promise further applications in the medical field.


INTRODUCTION
inflammatory lesions such as chronic pancreatitis and autoimmune pancreatitis (AIP). [1][2][3][4] Similarly, biliary diseases include neoplastic lesions, such as cholangiocarcinoma, and inflammatory conditions, such as primary sclerosing cholangitis and immunoglobulin 4-related sclerosing cholangitis. Accurate diagnosis before treatment is essential, as treatment strategies differ significantly for each disease.Various imaging modalities such as computed tomography, magnetic resonance imaging, abdominal ultrasonography, endoscopic ultrasonography (EUS), and endoscopic retrograde cholangiopancreatography (ERCP) are used to diagnose diseases in the hepatopancreatobiliary region. High-resolution images of the pancreas and biliary tract can be obtained using EUS, an important modality for treating pancreatobiliary diseases. 5 Procedures such as contrastenhanced EUS (CE-EUS), EUS-guided fine needle aspiration/biopsy (EUS-FNA/B), and EUS-elastography enhance the diagnostic performance of EUS. [6][7][8][9][10][11] However, EUS alone cannot diagnose all pancreatobiliary diseases because of its low specificity, even when using EUS-related procedures (with 80%-95% accuracy). 12 ERCP is also used for the diagnosis of pancreatic and biliary tract diseases and enables simultaneous interventions such as stone removal and bile duct stenting. However, it might result in severe adverse events such as post-ERCP pancreatitis. 13 Artificial intelligence (AI) is a mathematical classification or regression technique, while "deep learning" is an AI algorithm and an advanced machine learning method that uses neural networks. 14 During the past decade, AI has made dramatic progress and has been applied in the medical field, including for the diagnosis of pancreatobiliary diseases using numerous types of modalities. 11,12,[15][16][17][18][19] However, most of the associated reports have not been systematically categorized. This review describes two columns: 1) a simple checklist for evaluating AI performance and 2) the current status of AI for EUS and ERCP, especially related to pancreatobiliary diseases. However, this article is neither a systematic review nor a meta-analysis, as the published database was not systemically researched for publication as a meta-analysis.

MACHINE LEARNING AND DEEP LEARNING
Several AI architectures exist, including machine learning (ML) and deep learning (DL).Although DL is a subset of ML, there is a clear difference between the two in terms of whether feature extraction (such as texture analysis and histogram analysis) and feature selection (such as filter method and wrapper method) are performed during the preprocessing stage. DL does not require feature extraction and selection because it can directly use images as input values. 14,20 Various architectures have been employed to develop ML, including support vector machines, decision trees, random forests, factorization machines, logistic regression analyses, and neural networks (NN). 21,22 Gradient boosting machines are an evolution of random forests. Convolutional NNs or transformer architectures are utilized. Deep learning is generally utilized when images are used as input values. In comparison, radiomics generally refers to the process of performing feature extraction and selection on images, which are then input into ML rather than DL. 23 The primary roles of AI in the medical field include imaging diagnosis, so-called computer-aided diagnosis, and lesion detection, so-called computer-aided detection systems. The computer-aided detection system can be further classified into object detection and image segmentation.

EVALUATION METHOD OF AI PERFORMANCE
There are numerous articles on medical AI,but their findings are different. In addition, there are many specific terms about AI and some evaluation methods, metrics, and development phases, unlike general clinical medical research. Therefore, these factors may make it difficult to evaluate AI performance appropriately. Although guidelines such as standards for Reporting Diagnostic accuracy studies and transparent reporting of a multivariable prediction model for individual prognosis or diagnosis are important references, it is crucial to consider some checkpoints (design, input value, data volume, model, and metrics) for evaluating AI performance, as these guidelines can be somewhat complex. 24,25 Design First, we need to confirm the type of AI developed for classification, detection, or other purposes. The gold standard for the labels of each image should be checked. A pathological diagnosis or commonly used diagnostic criteria are desirable as the gold standard; however, in benign diseases where the pathological diagnosis is difficult, a combination of pathological findings and clinical observations (no malignant findings on biopsy and no change with follow-up) is acceptable. It is important to confirm whether the labels are for binary or multiclass classification and to check the definition of each label. For binary classification, it is important to confirm the definition of the control group. Next, we need to confirm the validation phase (internal or external). The recommended approach for external validation is to randomly divide the collected data into training and validation sets and collect a test set from another F I G U R E 1 Data splitting method during artificial intelligence (AI) development and validation. (a) External validation: Split the development data into training data and validation data. External validation data was collected from the cohort independent of the development data (e.g., data from other institutions or data collected after AI development). (b) Split-sample validation. the collected data were randomly divided into training, validation, and test sets. (c) Temporary validation. All data are divided into development and test data by period. Development data is randomly divided into training and validation data. (d) Internal validation (hold-out method). All data is randomly divided into training and validation data. (e) Internal validation (k-fold cross-validation). k-fold cross-validation divides the dataset into k groups. One group is used for validation, while the others are used for training the model. This process is repeated k times, and the results are averaged. (f) Internal validation (leave-one-out validation). A model is created by dividing one case of a cohort into a validation group and the others into a training group, and after training validation, the same procedure is repeated so that all data are in the validation group, and the validation result of all cases is the final result.
facility after developing the model (Figure 1a). If it is difficult to collect a separate test set, another option is to perform split-sample validation by randomly dividing the collected data into three sets: training, validation, and test ( Figure 1b). However, this approach does not strictly qualify as external validation but rather as internal validation. Therefore, employing a method called temporal validation is preferable, in which the test set is distinguished from the training and validation sets by setting a specific time period (Figure 1c). 24,25 Subsequently, the validation methods (cross-validation or holdout) should be confirmed during the internal validation phase (Figure 1d). Several types of cross-validation methods exist, including leave-one-out and K-fold crossvalidation. K-fold cross-validation is a method in which the entire dataset is divided into smaller chunks or folds called K groups. One group is used as the validation set, and the remaining groups are used as the training set to build the model. This process is repeated K times, with each group serving as the validation set. The results are obtained by averaging the validation results from all the iterations (Figure 1e). In leave-one-out, each data point in the dataset is considered as the validation set once, whereas the remaining data points are used for training. This means that if there are 'n' data points, the model is trained and validated 'n' times 24,25 (Figure 1f ).

Input value and models
The type of input values (image, clinical features, or a combination of these) should be confirmed. Next, it is necessary to determine whether the method is based on DL or ML. In the case of ML, it is important to consider the feature extraction methods, such as histograms or texture analysis. 24,25

Data volume
The inclusion criteria for the target cases and the method used for data splitting should be verified. DL often requires more training data than ML (DL > 1000, ML > 100). 5,20 To prevent data leakage (a phenomenon where information from the training group seeps into the validation/test groups, making it seem more accurate), all data from the same cases should be in the same split groups when the data are divided.

Metrics
Subsequently, the evaluation metrics of the model should be assessed. For classification tasks (such as disease diagnosis), the accuracy, sensitivity, specificity, and area under the curve from receiver operating characteristic (ROC) analysis can be used. In this case, describing the results using a confusion matrix is desirable. Detection performance can be evaluated for detection tasks using metrics such as intersection over union (IoU) or Dice score. Generally, successful detection is defined as IoU≥0.5. 26 The IoU and confidence scores are used to classify true positives (correctly detected), false positives (misdetected), and false negatives (detection failed). Unlike in classification problems, true negatives (non-lesion part not detected) cannot be theoretically calculated because there are countless negative areas. Therefore, detection evaluation metrics often include sensitivity (recall) or positive predictive value (precision), area under the curve of the precision-recall-curve, which is the average precision, and the mean average precision across classes. Theoretically, specificity, negative predictive value, and accuracy are not used (although calculations are sometimes performed by considering images without lesions as negative samples). 5,20 Metrics such as the mean IoU across classes are used for the segmentation tasks. Essentially, the metrics of the external validation group are the primary endpoints of the report. In the case of internal validation, the results of the validation group in the holdout method and the results of all cases in the cross-validation become the primary endpoints.

AI FOR EUS IMAGES
While referring to existing articles on AI in the field of biliary and pancreatic medicine, the PubMed, Embase, and Cochrane databases were systematically searched for articles published from inception to March 31, 2023, by one author (Takamichi Kuwahara). 5,20,[27][28][29] The search terms used were as follows: (artificial intelligence OR deep learning OR machine learning OR radiomics) AND (endoscopic ultrasonography OR endoscopic ultrasound OR EUS OR ERCP OR cholangioscopy). The findings of these articles indicated that AI for EUS has been developed for numerous purposes, such as classifying and detecting pancreatic tumors, pancreatic cysts, and submucosal tumors.

Detection of pancreatic tumors
One article on the AI detection of pancreatic tumors from EUS images was reported by Tonozuka et al. (Table 1). 30 They developed an AI to detect pancreatic tumors from EUS images using a fully convolutional net-F I G U R E 2 Artificial intelligence image for differential diagnosis of pancreatic masses. Endoscopic ultrasonography image (pseudo papillary neoplasms) is used for the diagnosis of carcinoma by artificial intelligence. The probability of one endoscopic ultrasonography image is expressed in the upper left, and AI diagnoses this lesion as non-carcinoma.
work based on a convolutional NN with EUS images from 93 cases of pancreatic disease. When their accuracy was evaluated using 47 test data, they reported an area under the ROC curve of 0.94.

Classification of pancreatic tumors
Twelve articles have been published on AI for classifying pancreatic tumors from EUS images. (Table 1) 12,[31][32][33][34][35][36][37][38][39][40][41] Saftiou et al. extracted features by performing a histogram analysis using EUS-elastography images of 258 pancreatic cancer or chronic pancreatitis cases and created an AI using NN. They conducted 10-fold crossvalidation to evaluate its accuracy and reported an accuracy of 0.843. 37,38 Kuwahara et al. created an AI for diagnosis using EfficientNetV2-L, one of the classification architectures of DL, with EUS images from 772 cases of multiple pancreatic diseases such as pancreatic cancer, AIP, neuroendocrine tumors (NET), solid-pseudopapillary neoplasm, and chronic pancreatitis. They evaluated its accuracy using 161 test data and reported an accuracy of 0.91 (Figure 2). 12 Marya et al. created an AI using ResNet50v2, one of the classification architectures of DL, with EUS images from 460 cases of pancreatic diseases, such as pancreatic cancer, AIP, and chronic pancreatitis. They evaluated its accuracy using 123 test data and reported a sensitivity of 0.9 and specificity of 0.85 (AIP vs. others). 39 Naito et al. conducted a study in which they annotated pathological images obtained from EUS-FNA specimens of 372 cases of pancreatic cancer. They used EfficientNet, another classification architecture of DL, for a differential diagnosis of pancreatic cancer. They evaluated the accuracy of the model using 120 external validation data and reported an accuracy of TA B L E 1 Main characteristics of included studies about endoscopic ultrasonography-artificial intelligence (EUS-AI) for pancreatic tumors.

Classification of pancreatic cysts
Five articles on AI detection of pancreatic tumors from EUS images have been reported ( cytology, clinical information, and blood test results from 85 cases of pancreatic cysts. 43 They conducted 5-fold cross-validation to evaluate its accuracy and reported an accuracy of 0.93. The accuracy of the AI was significantly higher than that of carcinoembryonic antigen in cystic fluid and cytology. Machicado et al. created an AI to differentiate the benign-malignant diagnosis of IPMN using VGG16 and a faster-R-convolutional NN, one of the classification and segmentation architectures of DL, with EUS-guided needle-based confocal laser endomicroscopy images from 35 cases of IPMN. They conducted 5-fold cross-validation to evaluate its accuracy and reported an accuracy of 0.85. The diagnostic ability of the AI was higher than the high-risk features according to guidelines (accuracy, 0.68-0.74). 44

Detection of pancreas parenchyma
Zang et al. created an AI to detect pancreatic parenchyma using UNet++, a segmentation architecture of DL, with 294 EUS images as input values. When they evaluated its accuracy using 20 external validation data, they reported a sensitivity (recall) of 0.984 and a positive predictive value (precision) of 0.824 (Table 2). In the same report, they also created an AI to differentiate  Table 2). The accuracy of this model was evaluated using 83 sets of external validation data and was found to be 0.762. 46

Classification of subepithelial lesions
Seven articles on AI classification of subepithelial lesions (SELs) from EUS images have been published ( Table 2). [47][48][49][50][51][52][53] Minoda et al. created an AI to differentiate gastrointestinal stromal tumors from leiomyomas using Xception,another classification architecture of DL, with EUS images of 173 cases of SELs. They evaluated its accuracy using 60 test data and reported an accuracy of 0.86 (for lesions <20 mm) and 0.9 (for lesions ≥20 mm). The diagnostic ability of the AI was higher than that of experts and human doctors (accuracy, 0.53-0.73). 48 Hirai et al.created an AI to diagnose multiclass SELs (gastrointestinal stromal tumors, leiomyoma, schwannoma, NET, and ectopic pancreas) using EfficientNetV2-L, one of the classification architectures of DL, with EUS images from 509 cases of SELs. They evaluated its accuracy using 122 test data and reported an accuracy of 0.86 (Figure 4). The diagnostic ability of the AI was higher than that of both expert and non-expert human doctors (accuracy, 0.58). 49

AI FOR ERCP
The AI for ERCP was developed for numerous purposes, such as evaluating the difficulty of ERCP using endoscopic images of the duodenal papilla and classifying the benign-malignant diagnosis of biliary structures using clinical features or cholangioscopy images ( Table 3).

Evaluation of the difficulty of ERCP
Two articles on AI evaluation of the difficulty of ERCP have been published (Table 3). 54,55 Huang et al. developed an AI system using CasNet, a segmentation architecture of DL trained on 1381 cholangiogram images. This AI could detect common bile duct stones. The researchers assessed its accuracy using 228 test data and reported a sensitivity (precision) of 0.67 and a positive predictive value of 0.8. By leveraging this AI, they established a difficulty-scoring system for stone removal (categorized as difficult or not difficult). They validated the performance of their system using 173 data. The machine lithotripsy rate, treatment time, and stone removal failure rate were significantly higher in cases the AI identified as "difficult" than in those predicted to be "not difficult." 54

Classification of benign-malignant diagnosis of biliary structures
Sugimoto et al. created an AI to diagnose the malignancy of biliary strictures using lightGBM, a classification architecture of ML, based on clinical and imaging features, such as bile duct diameter, from 206 cases with biliary strictures. They conducted a 5-fold crossvalidation to evaluate its accuracy and reported an accuracy of 0.86. (Table 3). 56 Marya et al. created an AI to differentiate the benign-malignant diagnosis of biliary strictures using ResNet50v2, one of the classification architectures of DL, using cholangioscopy images from 122 cases of biliary strictures. The researchers assessed its accuracy using 32 test data and reported an accuracy of 0.9 57 (Table 3).

LIMITATIONS AND FUTURE PERSPECTIVE ON AI FOR EUS AND ERCP
In this review, we evaluated AI for EUS and ERCP and found that it has been developed for numerous purposes.A number of useful reports have been sporadically observed; however, few studies have performed an accurate external validation, resulting in a limited number of reports with high levels of evidence. There are currently no approved biliary or pancreatic AI systems in Japan. The AIs for EUS and ERCP are not as advanced as that for plain endoscopy. One of the major reasons for the development of AI algorithms for EUS and ERCP is the availability of high-quality annotated data. Training AI models require large datasets encompassing diverse cases; however, EUS and ERCP datasets may be more limited than endoscopy datasets. To overcome this limitation, developing a nationwide system that collects and utilizes EUS and ERCP images is necessary. AI for EUS and ERCP has been developed only for diagnostic imaging and detection. In recent years, large-scale language models such as ChatGPT have emerged, and the latest models can use images and audio as input as well as output. Therefore, their regular application in the medical field is anticipated in the future.

CONCLUSION
The current status, limitations, and future perspectives of AI for EUS and ERCP were reported. The AI has the potential to be a breakthrough in the diagnosis of pancreatobiliary diseases where other modalities have diagnostic limitations.

C O N F L I C T O F I N T E R E S T S TAT E M E N T Takamichi Kuwahara, Kazuo Hara, Shin Haba, Nozomi
Okuno, Toshitaka Fukui, Minako Urata, and Yoshitaro Yamamoto declare no conflict of interest related to this study. Nobumasa Mizuno has received Grants or contracts from any entity from to their institution from Novartis, MSD, Incyte, Ono Pharmaceutical, Seagen, Dainippon Sumitomo Pharma; has received payment or honoraria for lectures, presentations, speakers bureaus, manuscript writing or educational events from Yakult Honsha, AstraZeneca, Novartis, FUJIFILM Toyama Chemical, MSD, Taiho Pharmaceutical; and has participated on a Data Safety Monitoring Board or Advisory Board for AstraZeneca.

E T H I C S S TAT E M E N T
This article does not contain any study with human or animal subjects performed by any of the authors.